This notebook will be dedicated to exploring details of the PISA 2012 dataset. PISA, in particular, is a "survey of students' skills and knowledge as they approach the end of compulsory education. It is not a conventional school test. Rather than examining how well students have learned the school curriculum, it looks at how well prepared they are for life beyond school" (Udacity, 2019).
Within this datset we can find information for about 510,000 students. The PISA 2012 dataset includes information on mathematics, reading in the test language, and science.
Throughout the course of this notebook I will have these two questions in mind:
To begin, let's start off by assessing the dataset and cleaning any remaining issues.
# Import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
# Read in the cleaned csv that was created in the wrangle_pisa notebook
pisa = pd.read_csv('pisa_df.csv')
# Set up variables for colors to be used in plotting
color1 = '#a7d7c5'
color2 = '#74b49b'
color3 = '#5c8d89'
color_male = '#ff8162'
color_female = '#ffcd60'
color_gends = ['#ffcd60', '#ff8162']
line = '#ff8000'
# How many rows and variables the dataset holds
pisa.shape
# What are the data types of the variables
pisa.dtypes
# See 10 examples of data in the dataset
pisa.sample(10)
# Decriptive statistics for each numeric variable
pisa.describe()
# The type and quantity of the educational levels for 'Education - Father'
pisa['Education - Father'].value_counts()
# The type and quantity of the educational levels for 'Education - Mother'
pisa['Education - Mother'].value_counts()
# Convert parental level of education into ordered categorical types
ordinal_var_dict = {'Education - Father': ['Early childhood', 'Primary', 'Lower secondary', 'Upper secondary', 'Post-secondary', 'Short-cycle tertiary', 'Bachelor’s or equivalent'],
'Education - Mother': ['Early childhood', 'Primary', 'Lower secondary', 'Upper secondary', 'Post-secondary', 'Short-cycle tertiary', 'Bachelor’s or equivalent']}
for var in ordinal_var_dict:
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
categories = ordinal_var_dict[var])
pisa[var] = pisa[var].astype(ordered_var)
pisa.shape
pisa['Student ID'].duplicated().sum()
pisa.drop_duplicates(inplace=True)
pisa.duplicated().sum()
pisa.shape
This cleaned version of the Pisa dataset from 2012 is composed of 43,715 rows, each of which represents one student. As for the features of this dataset, there are 18 variables that have been selected, most of which are numeric. Two of the variables are different however in that they are ordered categorical variables. They are the highest educational levels of the mother and father of the student, and are sorted from lowest level of education to highest level:
(least educated) —> (most educated)
<ISCED level 0> : Pre-primary education
<ISCED level 1> : Primary education or first stage of basic education
<ISCED level 2> : Lower secondary education or second stage of basic education
<ISCED level 3> : Upper secondary education
<ISCED level 4> : Post-secondary non-tertiary education
<ISCED level 5> : First stage of tertiary education
<ISCED level 6> : Second stage of tertiary education
The main feature that we will be exploring is the 'Average Total Score'.
To better understand the Average Total Score, I believe that 'Out-of-School Study Time - Total' and 'Learning time (minutes per week) - Total' will provide illuminating results. The average assumption is that the more homework a student completes, the better they will perform when writing tests, but there has been a recent uprise in research that explains that it is not a good predictor of test success. Rather, I expect that the educational level of the parents, and the amount of books that they have in their home will be a better feature to predict the student's test related success.
We can start off by looking at the main feature of interest: the average total score.
In particular, let's first look at a standard-scale plot of this value to see its distribution.
# Histogram of Average Total Score
binsize = 20
bins = np.arange(0, pisa['Average Total Score'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Average Total Score', bins = bins, color = color1)
plt.xlabel('Average Total Score')
plt.ylabel('Frequency')
plt.title('Frequency of Average Total Scores');
Here we can see that it is a very normal distribution. This is generally not surprising since bell curves are expected when it comes to the grades of students.
We can now move onto the three scores that the total score is comprised of: Math, Reading, and Science.
# Histogram of Average Math Score
binsize = 20
bins = np.arange(0, pisa['Average Math Score'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Average Math Score', bins=bins, color = color1)
plt.xlabel('Average Math Score')
plt.ylabel('Frequency')
plt.title('Frequency of Average Math Scores');
We can easily say that this distribution is very much so like the total score in that it has a distinct normal distribution.
# Histogram of Average Reading Score
binsize = 20
bins = np.arange(0, pisa['Average Reading Score'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Average Reading Score', bins=bins, color = color1)
plt.xlabel('Average Reading Score')
plt.ylabel('Frequency')
plt.title('Frequency of Average Reading Scores');
Just as with the Math score, we can see the average Reading score is falling along a normal distribution.
# Histogram of Average Science Score
binsize = 20
bins = np.arange(0, pisa['Average Science Score'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Average Science Score', bins=bins, color = color1)
plt.xlabel('Average Science Score')
plt.ylabel('Frequency')
plt.title('Frequency of Average Science Scores');
Just as with the Total, Math, and Reading scores, we can see the Science score also falls along a normal distribution.
We can now move onto the Study Time variables.
# Histogram of the Total Out-of-School Study Time
binsize = 2
bins = np.arange(0, pisa['Out-of-School Study Time - Total'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Out-of-School Study Time - Total', color = color2, bins = bins)
plt.xlabel('Out-of-School Study Time - Total (h/week)')
plt.ylabel('Frequency')
plt.title('Frequency of Total Out-of-School Study Times');
From this histogram for the Total Out-of-School Study Time, we can see a strong right skew on this unimodal distribution. Due to the tail that extends past the peak, we should look at this variable on a smaller scale.
# Histogram of the Total Out-of-School Study Time
binsize = 1
bins = np.arange(0, pisa['Out-of-School Study Time - Total'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Out-of-School Study Time - Total', color = color2, bins = bins)
plt.xlim(0,20)
plt.xlabel('Out-of-School Study Time - Total (h/week)')
plt.ylabel('Frequency')
plt.title('Frequency of Total Out-of-School Study Times');
The data for this distribution remains unimodal and quite consistent under a smaller scale.
Now we can look at each of the variables that have been used to create the Total Out-of-School Study Time: 'Out-of-School Study Time - Homework', 'Out-of-School Study Time - Guided Homework', 'Out-of-School Study Time - Personal Tutor', 'Out-of-School Study Time - Commercial Company', 'Out-of-School Study Time - With Parent'
# Histogram of the Out-of-School Study Time for Homework
binsize = 1
bins = np.arange(0, pisa['Out-of-School Study Time - Homework'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Out-of-School Study Time - Homework', color = color2, bins = bins)
plt.xlabel('Out-of-School Study Time - Homework (h/week)')
plt.ylabel('Frequency')
plt.title('Frequency of Out-of-School Study Times for Homework');
# Histogram of the Out-of-School Study Time for Guided Homework
binsize = 1
bins = np.arange(0, pisa['Out-of-School Study Time - Guided Homework'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Out-of-School Study Time - Guided Homework', color = color2, bins = bins)
plt.xlabel('Out-of-School Study Time - Guided Homework (h/week)')
plt.ylabel('Frequency')
plt.title('Frequency of Out-of-School Study Times for Guided Homework');
# Histogram of the Out-of-School Study Time with a Personal Tutor
binsize = 1
bins = np.arange(0, pisa['Out-of-School Study Time - Personal Tutor'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Out-of-School Study Time - Personal Tutor', color = color2, bins = bins)
plt.xlabel('Out-of-School Study Time - Personal Tutor (h/week)')
plt.ylabel('Frequency')
plt.title('Frequency of Out-of-School Study Times with a Personal Tutor');
# Histogram of the Out-of-School Study Time with a Commercial Company
binsize = 1
bins = np.arange(0, pisa['Out-of-School Study Time - Commercial Company'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Out-of-School Study Time - Commercial Company', color = color2, bins = bins)
plt.xlabel('Out-of-School Study Time - Commercial Company (h/week)')
plt.ylabel('Frequency')
plt.title('Frequency of Out-of-School Study Times with a Commercial Company');
# Histogram of the Out-of-School Study Time with a Parent
binsize = 1
bins = np.arange(0, pisa['Out-of-School Study Time - With Parent'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Out-of-School Study Time - With Parent', color = color2, bins = bins)
plt.xlabel('Out-of-School Study Time - With Parent (h/week)')
plt.ylabel('Frequency')
plt.title('Frequency of Out-of-School Study Times with a Parent');
Each of the above histograms for Out-of-School Study Time reflected exactly what we saw in the Total Out-of-School Study Time histogram. They are all strongly left skewed unimodal distributions, which is not much of a surprise since students generally put in in some Study Time outside of school, but the amount of time a student can dedicate to studying drops thereafter.
Now we can move on to look at the Learning Time distributions.
# Histogram of the Total Learning Time
binsize = 100
bins = np.arange(0, pisa['Learning Time - Total'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Learning Time - Total', color = color3, bins=bins)
plt.xlim(0, 2500)
plt.xlabel('Learning Time - Total (mins/week)')
plt.ylabel('Frequency')
plt.title('Frequency of Total Learning Times');
Although slightly skewed to the right, this distribution is much more normal if we compare to the Out-of-School Study Time distribution. But to understand Learning Time, we must look into each of the subjects.
# Histogram of the Total Learning Time
binsize = 25
bins = np.arange(0, pisa['Learning Time - Mathematics'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Learning Time - Mathematics', color = color3, bins=bins)
plt.xlim(0, 700)
plt.xlabel('Learning Time - Mathematics (mins/week)')
plt.ylabel('Frequency')
plt.title('Frequency of Learning Times for Mathematics');
This distribution for Mathematics related Learning Time generally matches the unimodal and normal distribution that we saw for the Total Learning Time, although it is more sporadic in nature.
# Histogram of the Total Learning Time
binsize = 25
bins = np.arange(0, pisa['Learning Time - Test Language'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Learning Time - Test Language', color = color3, bins=bins)
plt.xlim(0, 700)
plt.xlabel('Learning Time - Test Language (mins/week)')
plt.ylabel('Frequency')
plt.title('Frequency of Learning Times for the Test Language');
Once again, the distribution for Test Language reflect the same distributions that we saw for both Mathematics and the Total Learning Time.
# Histogram of the Total Learning Time
binsize = 25
bins = np.arange(0, pisa['Learning Time - Science'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = pisa, x = 'Learning Time - Science', color = color3, bins=bins)
plt.xlim(0, 700)
plt.xlabel('Learning Time - Science (mins/week)')
plt.ylabel('Frequency')
plt.title('Frequency of Learning Times for Science');
This distribution, on the other hand, shows a different story. For Science we can see a clear right skew.
Since all of the Learning Time variables have values that are beyond 600 minutes, and these values might distort our later plots, we should analyze them and determine if it makes sense to disregard them.
# Select high outliers for the learning time total, using criteria eyeballed from the plot
high_outliers_math = (pisa['Learning Time - Mathematics'] > 600)
print(high_outliers_math.sum())
print(pisa.loc[high_outliers_math,:])
high_outliers_lang = (pisa['Learning Time - Test Language'] > 600)
print(high_outliers_lang.sum())
print(pisa.loc[high_outliers_lang,:])
high_outliers_sci = (pisa['Learning Time - Science'] > 600)
print(high_outliers_sci.sum())
print(pisa.loc[high_outliers_sci,:])
Since the amount of outliers is so low and they do not bring exceptionally relevant information to the analysis, it will be better if we continue without them.
# Remove outliers
pisa = pisa.loc[-high_outliers_math & -high_outliers_lang & -high_outliers_sci,:]
# Re-plotting the distributions of Learning Times
fig, ax = plt.subplots(nrows=3, figsize = [18,20])
variables = ['Learning Time - Mathematics', 'Learning Time - Test Language', 'Learning Time - Science']
for i in range(len(variables)):
var = variables[i]
ax[i].hist(data = pisa, x = var, color=color3)
ax[i].set_xlabel('{} (mins/week)'.format(var))
ax[i].set_ylabel('Frequency')
ax[i].set_title('{}'.format(var))
plt.show()
Last but not least, we still have the parental education levels to analyze.
# The ordinal variable's distribution for both Mother's and Father's Education
fig, ax = plt.subplots(nrows=2, figsize = [18,18])
default_color = sb.color_palette()[0]
sb.countplot(data = pisa, x = 'Education - Father', color = color_male, ax = ax[0])
sb.countplot(data = pisa, x = 'Education - Mother', color = color_female, ax = ax[1])
plt.show()
Here it shows that the students that exist in this dataset typically have parents of higher educational levels. Short-cycle Tertiary education takes a clear majority for both mother and father, while parents with just Early Childhood education have the lowest amount of children in this dataset.
For 'Average Total Score', the distribution was strikingly normal. However, this was expected to an extent, since student grades typically fall along a bell curve. As a result, no unusual points stood out for this variable, nor did any stand out for the three scores that resulted in the total score. Therefore, no transformations were necessary to make sense of the data.
The secondary features investigated were Study Times, Learning Times, and Parental Education.
For Study Times, the total had a strong right skew, as did the rest of the Study Times that the total was composed of. To better understand this feature, we spread the total across a logarithmic scale to see if it was not in fact unimodal or to see any other irregularities. In the end, it ended up being unimodal and quite normal.
As for the Learning Time, this data clearly had outliers, so for each of the Learning Time's, the outliers over 600 minutes were excluded. This was done to look at the more typical student results, and so that later plots will not be distorted by these exceptionally dedicated students.
And the Parents Education variables have a bit too much weight on parents with higher educational levels, but considering the plots we will run, this should not have a great impact so we will leave it as is.
To start off, let's look at the correlations between each of the Scores, the Total Out-of-School Study Time, and the Total Learning Time to see if the amount of time dedicated to a subject has an influence on the score, and how strongly the Scores are correlated with one another. This will help us answer the question of whether or not there is a relationship between the amount of time a student dedicates to learning and their score.
numeric_vars = ['Average Math Score', 'Average Reading Score', 'Average Science Score', 'Average Total Score', 'Out-of-School Study Time - Total', 'Learning Time - Total']
# Correlation plot
plt.figure(figsize = [8, 5])
sb.heatmap(pisa[numeric_vars].corr(), annot = True, fmt = '.3f',
cmap = 'BrBG', center = 0)
plt.show()
Considering the correlations between the Scores, the Total Out-of-School Study Time and Total Learning Time, we can see that the Total Learning Time is slightly better correlated with the scores than the Total Out-of-School Study Time, with the Average Reading Score being the exception.
To better understand the relationship between the Scores and the Learning Time, lets look at the breakdown of each of the Learning Time per subject.
score_learn_vars = ['Average Math Score', 'Average Reading Score', 'Average Science Score',
'Average Total Score', 'Learning Time - Mathematics',
'Learning Time - Test Language', 'Learning Time - Science',
'Learning Time - Total']
# correlation plot
plt.figure(figsize = [8, 5])
sb.heatmap(pisa[score_learn_vars].corr(), annot = True, fmt = '.3f',
cmap = 'BrBG', center = 0)
plt.show()
Interestingly, we can see that the Learning Time for Mathematics and the Test Language have no correlation at all with any of the Scores when compared to the Learning Time for Science.
We can look at these variables now through another perspective: seeing the scatter plot relationships between them.
samples = np.random.choice(pisa.shape[0], 500, replace = False)
pisa_samp = pisa.loc[samples,:]
g = sb.PairGrid(data = pisa_samp, vars = score_learn_vars)
g = g.map_diag(plt.hist, bins = 20, color='#ffcd60');
g.map_offdiag(plt.scatter, color = color1);
As expected, we can clearly see a strong positive correlation between each of the Scores. As for the relationships between the Learning Times, a positive relationship is visual between each of them, albeit not very strong with the exception of some outliers.
When it comes to the relationship between the Scores and Learning Times, we can see that the amount of time a study spends learning a topic has no relationship with the Score that they will receive according to this plot.
score_study_vars = ['Average Math Score', 'Average Reading Score', 'Average Science Score',
'Average Total Score', 'Out-of-School Study Time - Homework',
'Out-of-School Study Time - Guided Homework',
'Out-of-School Study Time - Personal Tutor',
'Out-of-School Study Time - Commercial Company',
'Out-of-School Study Time - With Parent',
'Out-of-School Study Time - Total']
# correlation plot
plt.figure(figsize = [8, 5])
sb.heatmap(pisa[score_study_vars].corr(), annot = True, fmt = '.3f',
cmap = 'BrBG', center = 0)
plt.show()
The results of this correlation plot are noteworthy in that it indicates that study time in terms of Guided Homework, with Personal Tutor, with a Commercial Company, and with a Parent have no positive influence on the score of a student. This could be related to the fact that the students who do need this amount of help are already the ones who struggle with grades, but since we have no information on previous Scores of said students, we cannot explore this theory any further for now.
We can however, look deeper into the role of Homework in the students Score.
score_study_vars = ['Average Math Score', 'Average Reading Score', 'Average Science Score',
'Average Total Score', 'Out-of-School Study Time - Homework']
samples = np.random.choice(pisa.shape[0], 500, replace = False)
pisa_samp = pisa.loc[samples,:]
g = sb.PairGrid(data = pisa_samp, vars = score_study_vars)
g = g.map_diag(plt.hist, bins = 20, color='#ffcd60');
g.map_offdiag(plt.scatter, color = color1);
Although the relationship between Homework Study Time and all the various Scores is weak, we can see that the more time a student spends on Homework, the higher their Score is. But this relationship only really exists until the Score is about 450. So for the students who are at the bottom of the Scoring rank spend time doing Homework, then they can move into the average Scores. As for the higher Scores, seems like they are generally unaffected.
Lastly, let's look at the relationship between the Study Time and Learning Time variables to see if they strongly with one another in any interesting way.
time_vars = ['Out-of-School Study Time - Homework',
'Out-of-School Study Time - Guided Homework',
'Out-of-School Study Time - Personal Tutor',
'Out-of-School Study Time - Commercial Company',
'Out-of-School Study Time - With Parent',
'Learning Time - Mathematics',
'Learning Time - Test Language',
'Learning Time - Science']
# correlation plot
plt.figure(figsize = [8, 5])
sb.heatmap(pisa[time_vars].corr(), annot = True, fmt = '.3f',
cmap = 'BrBG', center = 0)
plt.show()
When it comes to the Study Times and Learning Times, no relationship is visible, and they barely have any relationships with the categories themselves. So, we cannot say that certain students study within school and outside of school more than others. In general for this section, we cannot see much of an influence from Time spent learning on Scores.
Now we can see our next set of factors that might influence the Score of a student:
To start off, let's look at the distribution of each level of education and the frequency of each.
g = sb.FacetGrid(data = pisa, col = 'Education - Mother');
g.map(plt.hist, 'Average Total Score', color = color_female);
Here we can see that the children in this dataset frequently have mothers with a Short-cycle Tertiary Education. In terms of Scores for each level, children with mothers who have just Early Childhood education perform much worse, with distribution that does not even reach the Score of 600. Meanwhile, the highest level of Bachelor's or equivalent is slightly left skewed and goes past the 600 mark.
g = sb.FacetGrid(data = pisa, col = 'Education - Father');
g.map(plt.hist, 'Average Total Score', color = color_male);
The same can be said for the education levels for the fathers. Except here we have more fathers with Bachelor's or equivalent educations.
Next we can look at the distribution for each of these levels to see the range and medians better.
plt.figure(figsize=[18,8])
sb.violinplot(data = pisa,
x = 'Education - Father',
y = 'Average Total Score',
color = color_male)
plt.title('Average Total Score Across Education Levels of Father');
Interestingly, the spread is quite large for the children of higher educated fathers. In fact, it appears that the child who performed worst had a father with Short-cycle Tertiary education. Meanwhile, the children with parents who have only Early Childhood education seem to have a much smaller range and exist to a much greater extent around the median.
plt.figure(figsize=[18,8])
sb.violinplot(data = pisa,
x = 'Education - Mother',
y = 'Average Total Score',
color = color_female)
plt.title('Average Total Score Across Education Levels of Mother');
The violin plot for the Mother's Education is more along the lines of what we expect, with the median growing from one level to the next, and each of which has a reasonable spread.
But to see the extent to which the outliers play a role, we can look at the same data with box plots.
plt.figure(figsize=[18,8])
sb.boxplot(data = pisa,
x = 'Education - Father',
y = 'Average Total Score',
color = color_male);
plt.title('Average Total Score Across Education Levels of Father');
Once again we can see the student who performs lowest overall is an outlier for the Short-cycle Tertiary level, and in general the same trend exists.
plt.figure(figsize=[18,8])
sb.boxplot(data = pisa,
x = 'Education - Mother',
y = 'Average Total Score',
color = color_female)
plt.title('Average Total Score Across Education Levels of Mother');
Here we can see that for the lower education levels for the mother, the students are generally achieving lower grades, but there are a good amount of high score outliers. While on the other half of the educational levels, there is a tendency for high grades with a few low score outliers.
# Score averages of students vs education levels of Father
plt.figure(figsize=[18,8])
sb.pointplot(data = pisa,
x = 'Education - Father',
y = 'Average Total Score',
color = color_male)
# Score averages of students vs education levels of Mother
sb.pointplot(data = pisa,
x = 'Education - Mother',
y = 'Average Total Score',
color = color_female)
plt.title('Average Total Score Across Education Levels of Parents')
# Set legend
plt.legend(labels=['Fathers Education', 'Mothers Education'])
# https://stackoverflow.com/questions/23698850/manually-set-color-of-points-in-legend
ax = plt.gca()
leg = ax.get_legend()
leg.legendHandles[0].set_color(color_male)
leg.legendHandles[1].set_color(color_female);
In general, we can see that the student Scores grow with the education level of the parent, regardless of the gender of the parent, until a point where it seems to plateau.
Now we can move towards looking at the gender of the child as well.
plt.figure(figsize=[10,8])
sb.boxplot(data = pisa,
x = 'Gender',
y = 'Average Total Score',
palette = color_gends);
If we look at the role that Gender plays on the Score, the range seems to match. However, the males seem to dip lower with their outliers.
plt.figure(figsize=[18,7])
sb.countplot(data = pisa, x = 'Education - Father', hue = 'Gender', palette = color_gends);
Here we can see how many female and male children have parents that fall into the educational levels. It's generally about the same, except for Bachelor's or equivalent, where there are many more males than females.
Now we can look at whether gender plays a role in the Score of a student.
# Create a subset to better see comparison plots
np.random.seed(2018)
sample = np.random.choice(pisa.shape[0], 200, replace=False)
pisa_subset = pisa.loc[sample]
g = sb.FacetGrid(data = pisa_subset, hue = 'Gender', palette = color_gends, height=5)
g.map(sb.regplot, 'Average Total Score', 'Average Reading Score', fit_reg = False)
plt.legend();
Here we can see that females have a tendency for higher Reading Scores, and males have a tendency for higher Math Scores.
g = sb.FacetGrid(data = pisa_subset, hue = 'Gender', palette = color_gends, height=5)
g.map(sb.regplot, 'Average Total Score', 'Average Science Score', fit_reg = False)
plt.legend();
The same separation cannot be made when comparing Math to Science for male and female. They seem to overlap completely.
g = sb.FacetGrid(data = pisa_subset, hue = 'Gender', palette = color_gends, height=5)
g.map(sb.regplot, 'Average Total Score', 'Average Math Score', fit_reg = False)
plt.legend();
Once again, females outperform a bit when it comes to males and the Reading Score.
g = sb.FacetGrid(data = pisa_subset, hue = 'Gender', palette = color_gends, height=5)
g.map(sb.regplot, 'Average Total Score', 'Out-of-School Study Time - Homework', fit_reg = False)
plt.legend();
When it comes to the one Out-of-School Study Time variable that had any noteworthy correlation from before, the Homework variable here has a negligible relationship to Score, as well as Gender.
g = sb.FacetGrid(data = pisa_subset, hue = 'Gender', palette = color_gends, height=5)
g.map(sb.regplot, 'Average Science Score', 'Learning Time - Science', fit_reg = False)
plt.legend();
As with the Out-of-School Study Time variable, we can look at the Science Score vs. the Science Score here since it was the strongest relationship. Once again, the effect of Gender is not visible.
In this section it became visible that the Scores were less influenced by Out-of-School Study Time and Learning Time than expected. For Learning Time in school, we saw that Science had a more positive correlation with each of the Scores than the Math and Reading Learning Times.
The scores were however strongly associated to the Educational level of the parents. We saw that the higher the level of education of either the mother or father, the higher the score of the student is more likely to be, on average at least. Also, we saw that the female students slightly outperformed the male students on the Average Reading Score, but generally the females and males performed the same throughout.
Interestingly enough, Out-of-School Study Time and Learning Time were not as significant as I had expected. In particular, we can see that the only significant and positively correlated Out-of-School Study Time variable was Homework, and the rest were correlated in a weak negative way to the student's score.
To start off this section of exploration, let's continue the box plots and gender comparisons from before.
plt.figure(figsize=[18,10])
sb.boxplot(data = pisa,
x = 'Education - Father',
y = 'Average Total Score',
hue = 'Gender',
palette = color_gends);
plt.figure(figsize=[18,10])
sb.boxplot(data = pisa,
x = 'Education - Mother',
y = 'Average Total Score',
hue = 'Gender',
palette = color_gends);
plt.show();
Here we answer one of the original questions of whether there differences in achievement based on gender or parental education levels. For both Father and Mother, we can see a negligible difference between males and females for all levels. The widest gap between the two genders exists for the Primary education level for both Father and Mother, but the proportion of students in this category is small, and the medians are nevertheless similar enough.
And when it comes to the educational levels of the parents, well those definitely play a role in how successful a student tends to be. There is of course a dramatic spread in both directions and outliers, but it seems that the median Score for students is closely related to the educational level of either Mother or Father.
Now we can observe the relationship between Learning Times and their respective subjects.
# Faceted scatter plots on levels of father's education
g = sb.FacetGrid(data = pisa, col = 'Education - Father', col_wrap = 4, height = 5)
g.map(sb.regplot, 'Average Science Score', 'Learning Time - Science',
color = color1, x_jitter = 0.3,
scatter_kws = {'alpha' : 1/20},
line_kws={"color": color3})
g.set_xlabels('Average Science Score')
g.set_ylabels('Learning Time(mins/week)- Science')
plt.show()
Previously, we saw that the amount of Learning Time for Science looked promising when it came to its correlation to its corresponding Score, the Average Science Score, at least in comparison to the other pairs. However, when we look at the regression plots we see here, we can see there might be a separation between the students. The line of regression appears to be showing a negative correlation between Learning Time for Science and the Average Science Score for the students whose Fathers achieved Primary, Lower secondary, and Upper secondary education. On the other hand, with the 3 highest levels of education in our dataframe, Post-secondary, Short-cycle tertiary, and Bachelor's or equivalent, we can see a positive correlation. This might indicate that the higher the education of the father, the more likely that Science related Learning Time in school will produce a higher grade.
# Faceted scatter plots on levels of mother's education
g = sb.FacetGrid(data = pisa, col = 'Education - Mother', col_wrap = 4, height = 5)
g.map(sb.regplot, 'Average Science Score', 'Learning Time - Science', color = color_female, x_jitter = 0.3,
scatter_kws = {'alpha' : 1/20},
line_kws={"color": line})
g.set_xlabels('Average Science Score')
g.set_ylabels('Learning Time(mins/week)- Science')
plt.show()
To solidify add support to the argument that the more Science related Learning Time in school there is, the better the Science Score of the student will be if the parental education is Post-secondary or higher, we can see that the results for the Mother's education match.
Considering this, it would be interesting if we saw similar results for the Mathematics and Reading related scores.
# Faceted heat maps on levels of father's education
g = sb.FacetGrid(data = pisa, col = 'Education - Father', col_wrap = 4, height = 5)
g.map(sb.regplot, 'Average Math Score', 'Learning Time - Mathematics', color = color1, x_jitter = 0.3,
scatter_kws = {'alpha' : 1/20},
line_kws={"color": color2})
g.set_xlabels('Average Math Score')
g.set_ylabels('Learning Time(mins/week)- Math')
plt.show()
# Faceted scatter plot on levels of mothers's education
g = sb.FacetGrid(data = pisa, col = 'Education - Mother', col_wrap = 4, height = 5)
g.map(sb.regplot, 'Average Math Score', 'Learning Time - Mathematics', color = color_female, x_jitter = 0.3,
scatter_kws = {'alpha' : 1/20},
line_kws={"color": line})
g.set_xlabels('Average Math Score')
g.set_ylabels('Learning Time(mins/week)- Math')
plt.show()
Considering the results for both Mother and Father, the results are far too unspectacular. We cannot conclude the same correlation as we could for the Science Learning Time and Score relationship. Here, the amount of Learning Time for Mathematics does not seem to play a role in the Math Score for a child.
# Faceted heat maps on levels of father's education
g = sb.FacetGrid(data = pisa, col = 'Education - Father', col_wrap = 4, height = 5)
g.map(sb.regplot, 'Average Reading Score', 'Learning Time - Test Language', color = color1, x_jitter = 0.3,
scatter_kws = {'alpha' : 1/20},
line_kws={"color": color2})
g.set_xlabels('Average Reading Score')
g.set_ylabels('Learning Time(mins/week)- Test Language')
plt.show()
# Faceted heat maps on levels of mother's education
g = sb.FacetGrid(data = pisa, col = 'Education - Mother', col_wrap = 4, height = 5)
g.map(sb.regplot, 'Average Reading Score', 'Learning Time - Test Language', color = color_female, x_jitter = 0.3,
scatter_kws = {'alpha' : 1/20},
line_kws={"color": line})
g.set_xlabels('Average Reading Score')
g.set_ylabels('Learning Time(mins/week)- Test Language')
plt.show()
Just as we saw for the Mathematics Learning Time and Score, we can see the same for Reading Score and Learning Time of the Test Language. There are no clear trends and we cannot conclude that the Learning Time plays a role in the success in the Reading Score.
So for the Learning Times, we can conclude that Science related Learning Time had the biggest influence on its corresponding Score, and the other two are negligible.
Now we can move onto Out-of-School Study Time. Previously, there was very few promising results out of the analysis, so let's see if analyzing the parental education levels might change the results.
# Faceted heat maps on levels of father's education
g = sb.FacetGrid(data = pisa, col = 'Education - Father', col_wrap = 4, height = 5)
g.map(sb.regplot, 'Average Total Score', 'Out-of-School Study Time - Total',
color = color2,
x_jitter = 0.3,
scatter_kws = {'alpha' : 1/20},
line_kws={"color": color3})
g.set_xlabels('Average Total Score')
g.set_ylabels('Out-of-School Study Time (h/week) - Total')
plt.show()
# Faceted heat maps on levels of father's education
g = sb.FacetGrid(data = pisa, col = 'Education - Mother', col_wrap = 4, height = 5)
g.map(sb.regplot, 'Average Total Score', 'Out-of-School Study Time - Total',
color = color_female,
x_jitter = 0.3,
scatter_kws = {'alpha' : 1/20},
line_kws={"color": line})
g.set_xlabels('Average Total Score')
g.set_ylabels('Out-of-School Study Time (h/week) - Total')
plt.show()
Here we can see the Total Out-of-School Study Time vs. the Average Total Score. It's very clear that there is no meaningful relationship between these two. We can look into each of the variables that made up the Total Out-of-School Study Time to see if there are any observable relationships.
# Faceted heat maps on levels of father's education
g = sb.FacetGrid(data = pisa, col = 'Education - Father', col_wrap = 4, height = 5)
g.map(sb.regplot, 'Average Total Score', 'Out-of-School Study Time - Guided Homework',
color = color2,
x_jitter = 0.3,
scatter_kws = {'alpha' : 1/20},
line_kws={"color": color3})
g.set_xlabels('Average Total Score')
g.set_ylabels('Out-of-School Study Time (h/week) - Guided Homework')
plt.show()
# Faceted heat maps on levels of father's education
g = sb.FacetGrid(data = pisa, col = 'Education - Mother', col_wrap = 4, height = 5)
g.map(sb.regplot, 'Average Total Score', 'Out-of-School Study Time - Guided Homework',
color = color_female,
x_jitter = 0.3,
scatter_kws = {'alpha' : 1/20},
line_kws={"color": line})
g.set_xlabels('Average Total Score')
g.set_ylabels('Out-of-School Study Time (h/week) - Guided Homework')
plt.show()
The relationship between Guided Homework Study Time and the Total Score does not look good. In fact, we see a subtle negative correlation for every Level of Education for both Fathers and Mothers.
# Faceted heat maps on levels of father's education
g = sb.FacetGrid(data = pisa, col = 'Education - Father', col_wrap = 4, height = 5)
g.map(sb.regplot, 'Average Total Score', 'Out-of-School Study Time - Personal Tutor',
color = color2,
x_jitter = 0.3,
scatter_kws = {'alpha' : 1/20},
line_kws={"color": color3})
g.set_xlabels('Average Total Score')
g.set_ylabels('Out-of-School Study Time (h/week) - Personal Tutor')
plt.show()
# Faceted heat maps on levels of father's education
g = sb.FacetGrid(data = pisa, col = 'Education - Mother', col_wrap = 4, height = 5)
g.map(sb.regplot, 'Average Total Score', 'Out-of-School Study Time - Personal Tutor',
color = color_female,
x_jitter = 0.3,
scatter_kws = {'alpha' : 1/20},
line_kws={"color": line})
g.set_xlabels('Average Total Score')
g.set_ylabels('Out-of-School Study Time (h/week) - Personal Tutor')
plt.show()
The same can be said for the Personal Tutors and Score. This could be of course due to the fact that students who need more time with Personal Tutors are already the ones who struggle, but that is a claim that is a little to large for this data analysis.
# Faceted heat maps on levels of father's education
g = sb.FacetGrid(data = pisa, col = 'Education - Father', col_wrap = 4, height = 5)
g.map(sb.regplot, 'Average Total Score', 'Out-of-School Study Time - Commercial Company',
color = color2,
x_jitter = 0.3,
scatter_kws = {'alpha' : 1/20},
line_kws={"color": color3})
g.set_xlabels('Average Total Score')
g.set_ylabels('Out-of-School Study Time (h/week) - Commercial Company')
plt.show()
# Faceted heat maps on levels of father's education
g = sb.FacetGrid(data = pisa, col = 'Education - Father', col_wrap = 4, height = 5)
g.map(sb.regplot, 'Average Total Score', 'Out-of-School Study Time - With Parent',
color = color2,
x_jitter = 0.3,
scatter_kws = {'alpha' : 1/20},
line_kws={"color": color3})
g.set_xlabels('Average Total Score')
g.set_ylabels('Out-of-School Study Time (h/week) - With Parent')
plt.show()
For students Study Time with either a Commercial Company or with a Parent, we can see the same trend that we saw for Guided Homework and Personal Tutor, so the need to see it applied to Fathers educational levels, as well as the Mothers, is not necessary. Once again we can see a tiny negative correlation, indicating that if the student requires more Study Time, then it will not guarantee a higher Score.
And last but not least, the most promising variable of the Out-of-School Study Time grouping: Homework.
# Faceted heat maps on levels of father's education
g = sb.FacetGrid(data = pisa, col = 'Education - Father', col_wrap = 4, height = 5)
g.map(sb.regplot, 'Average Total Score', 'Out-of-School Study Time - Homework', color = color2,
x_jitter = 0.3,
scatter_kws = {'alpha' : 1/20},
line_kws={"color": color3})
g.set_xlabels('Average Total Score')
g.set_ylabels('Out-of-School Study Time (h/week) - Homework')
plt.show()
Here we have a very clear relationship that indicates that the more time a student spends on Homework, the higher their Total Score will be. This is applicable for each educational level for the father, and it is quite a big contrast to all the other Out-of-School Study Time variables.
# Faceted heat maps on levels of father's education
g = sb.FacetGrid(data = pisa, col = 'Education - Mother', col_wrap = 4, height = 5)
g.map(sb.regplot, 'Average Total Score', 'Out-of-School Study Time - Homework', color = color_female,
x_jitter = 0.3,
scatter_kws = {'alpha' : 1/20},
line_kws={"color": line})
g.set_xlabels('Average Total Score')
g.set_ylabels('Out-of-School Study Time (h/week) - Homework')
plt.show()
Just as was the case for the father, the mothers levels of education all indicate the same positive correlation between Homework related Study Time and Total Score.
As a final analysis, we can look at the fathers level of education in comparison to the three Scores that the Total Score is comprised of.
# Faceted heat maps on levels of fathers education
g = sb.FacetGrid(data = pisa, col = 'Education - Father', col_wrap = 4, height = 5)
g.map(sb.regplot, 'Average Science Score', 'Out-of-School Study Time - Homework', color = color2,
x_jitter = 0.3,
scatter_kws = {'alpha' : 1/20},
line_kws={"color": color3})
g.set_xlabels('Average Science Score')
g.set_ylabels('Out-of-School Study Time (h/week) - Homework')
plt.show()
# Faceted heat maps on levels of father's education
g = sb.FacetGrid(data = pisa, col = 'Education - Father', col_wrap = 4, height = 5)
g.map(sb.regplot, 'Average Math Score', 'Out-of-School Study Time - Homework', color = color2,
x_jitter = 0.3,
scatter_kws = {'alpha' : 1/20},
line_kws={"color": color3})
g.set_xlabels('Average Math Score')
g.set_ylabels('Out-of-School Study Time (h/week) - Homework')
plt.show()
# Faceted heat maps on levels of father's education
g = sb.FacetGrid(data = pisa, col = 'Education - Father', col_wrap = 4, height = 5)
g.map(sb.regplot, 'Average Reading Score', 'Out-of-School Study Time - Homework', color = color2,
x_jitter = 0.3,
scatter_kws = {'alpha' : 1/20},
line_kws={"color": color3})
g.set_xlabels('Average Reading Score')
g.set_ylabels('Out-of-School Study Time (h/week) - Homework')
plt.show()
For each of these, we can see the exact same trend. There is a positive correlation between the amount of time a study puts into Homework related Study Time and the Score they receive.
Throughout this section, we investigated further into what kind of effect parental education has on the scores of the students. In particular, we started off by seeing if there was a difference between the gender of a student and how well they scored, in relation to their parental level of education. For both mother's and father's level of education, we saw consistency between the genders except for very insignificant differences.
Then, we continued on to see the relationship between Learning Times and their respective subjects. As we saw in the bivariate analysis, Learning Time spent on Science had the best outcomes, but there was a catch. I will continue this topic in the question below. As for the rest, the relationship was negligible and no relationship could be established.
And finally, we looked at the relationship between the Out-of-School Study Times and the Average Total Scores. This was equally negligible for all categories except for one: Homework. We continued on to see Homework in comparison to each of the scores that the Average Total Score was composed of, and the positive correlation persisted. I would still classify it as a weak relationship, but it nevertheless was there.
The most notable finding was the difference between students scores across parental levels of education when comparing to Learning Time for Science. It showed that although learning time for Science seems like a variable that would increase a students score in Science, we cannot assume that it is the case for students in all circumstances. We found that for students with parents of lower educational levels, spending more time in school learning Science related topics did not have the positive correlation that we saw with the scores for the students with parents of higher educational levels. Therefore, students who spent more time learning science in school only had a visible benefit when their parents had post-secondary education or higher.